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Abstract 

A variety of dimension reduction methods can be framed as (generalized) eigenvalue decomposition prob- 
lems. For these methods we develop an algorithmic framework for inference of low-dimensional structure 
from massive high-dimensional data by leveraging recently developed highly scalable randomized low-rank 
approximation approaches. We propose efficient randomized algorithms for supervised and unsupervised 
(non)linear dimension reduction - specifically, (localized) Sliced Inverse Regression (SIR), Locality Pre- 
serving Projections (LPP), and Principal Components Analysis (PCA). A key point in our development is 
the emphasis on the dual role of randomization as simultaneously enhancing the computational feasibility 
while introducing regularization in the dimension reduction estimators. This point is highlighted by results 
on simulated and real data that show improved performance for the randomized over the exact methods. 
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1 Introduction 



In the current era of information, large amounts of complex high dimensional data are routinely generated 
across science and engineering. Understanding the underlying structure in the data and using this structure 
to model scientific problems is of fundamental importance in a variety of applications. As the size of data 
sets increases the problem of statistical inference and computational feasibility become inextricably linked. 
The idea of dimension reduction is a natural approach to summarize massive data and has historically played 
a central role in data visualization, predictive modeling and general data analysis in both statistical inference 
( |Adcock| [T8781 |Edegworth| [1884} |Fisher| [7922} |Hotelling| [T933] |Young| [T94T) and numerical analysis 
( |Golub|[T969l|Golub and Van Loan|[T996l|Gu and Eisenstat|[T996]|Golub et aL|[2000l >, for a recent review 
see |Mah oney (2011 1. Historically, statisticians have focused on procedures and theory for estimators, for 
example the subspace that captures the maximal variation in data drawn from a population. Numerical 
analysis, or computational mathematics, is then used to provide efficient algorithms, with provable stability 
and convergence guarantees, to compute these estimates. A classic example of this interplay is Principal 
Components Analysis (PC A) (Hotelling, 1933). In PCA a target estimator is defined based on statistical 
considerations about the sample variance in the data, which is then efficiently computed using a variety 
of Singular Value Decomposition (SVD) algorithms developed by the numerical analysis community. In 
this paper we focus on the problem of dimension reduction integrating the statistical considerations of 
estimation accuracy and out-of-sample errors with the numerical considerations of runtime and numerical 
accuracy. Given a very large high-dimensional data set a natural approach is to try and compress it with 
minimal loss of information prior to statistical analysis. In this paper we focus on dimension reduction as a 
means of compression and out-of-sample residual errors as a measure of information. This idea underlies 
a variety of approaches to linear and non-linear dimension reduction both in the unsupervised and the 
supervised setting. When the goal is prediction of a quantitative response, dimension reduction is often 
coupled with regression. Many dimension reduction methods ( Hotelling 1933;Fisher, 1936;Li, 1991.WU 
|eTaLl|20T0"{|Belkin and Niyogi|[2003)|Donoho and Grimes||200"3"{|Roweis and S aul 2000) can be reduced 
to the solution of generalized eigendecomposition problems. These can be efficiently solved when either 
the number of observations, n, or the number of dimensions, p, is small. However, when both dimensions 
are large the typical computational cost of 0{n 2 p) of current algorithms that rely on eigendecomposition of 
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the data matrix becomes prohibitive. The numerical analysis community has recently developed powerful 
randomized algorithms for approximate solutions to eigendecomposition problems (Drineas et al. 2006; 
|Sarlos|[2006l|Liberty et al.1|20071|Boutsidis et al.||2009"l|Rokhlin etaT| [20081 |Halko et al.||2009> , including 
applications to efficient implementation of PCA. Analysis of these randomized algorithms has focused on 
the difference between the approximate and the exact decomposition of the observed data matrix. This 
misses the very important point that the observed data is a random realization from a population and our 
interest is in a comparison of the approximate decomposition of the observed data to that of the population. 
From this perspective the approximate decomposition not only plays an important computational role but 
also has an impact on the inference properties of the resulting estimators. One objective of this paper is to 
demonstrate that randomization serves as a regularizer and can help avoid over-fitting to the noise in the 
observed data. In this paper we will develop novel randomized algorithms for dimension reduction that scale 
to massive data and discuss the relation between randomization and the classic notions of regularization 
or shrinkage. We will focus on key parameters in these numerical algorithms which have both runtime 
implications as well as a natural interpretation as regularizers of the final estimators. Specifically, we 
develop randomized algorithms for a variety of dimension reduction methods including: PCA as a linear 
unsupervised method, Sliced Inverse Regression (SIR) ( Li 1991 1 as a linear supervised method, Localized 
Sliced Inverse Regression (LSIR) (Wu et al]j|2010) as a supervised non-linear dimension reduction method, 
and Locality Preserving Projections (LPP) l |He and Niyogi, 20031 a non-linear unsupervised method. The 
framework used to develop these randomized algorithms is general and can be extended to a variety of 
dimension reduction methods ( Tenenbaum et al. 2000 , Roweis and Saul , 2000 , Donoho and Grimes , 2003 , 
|Belkin and Niyogi| [2003] |Li) [I99T| [Cook and Weisberg] [T99Tj |Li[ [19921 |Hastie and Tibshirani| [1996} 
|Goldberger eTaLl[2005||Globerson and Roweis|[2006]|Li et al.||2005)|Nilsson et al.|[2007l|Sugiyama|[2007l 
[Cookl [20071 [Wu et al.||2010) . 

2 Randomized algorithms for dimension reduction 

We will develop randomized algorithms for four dimension reduction methods: PCA, SIR, LSIR, and 
LPP. The first algorithm will provide a numerically efficient and statistically robust estimate of the prin- 
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cipal components from data based on a randomized algorithm for singular value decomposition (RSVD) 
(Rokhlin et al. 2008, Halko et al. 2009) . In this problem the objective is linear unsupervised dimension 
reduction with the low-dimensional subspace estimated via an eigendecomposition. RSVD will serve as the 
core computational engine for the other three dimension reduction algorithms in which estimation reduces 
to a generalized eigendecomposition problem. The second algorithm computes a low dimensional linear 
subspace that captures the predictive information in the data. This is a supervised setting, in which the 
input data consist of a set of features and a response variable for each observation. We decided to focus 
on SIR ( Li 1991 ) as it applies to both categorical and continuous responses and subsumes the widely used 
Linear Discriminants Analysis (LDA) ( Fisher 1936 1 as a special case. The third method to which we apply 
randomization ideas is LSIR ( |Wu et alT| |20 10^ , which is an extension of SIR to the non-linear supervised 
dimension reduction setting. Finally, we develop a randomized algorithm for unsupervised manifold learn- 
ing ( |Tenenbaum eTaL|[2000irRoweis and Saul] [2000) [Donoho and Grimes||2003||Belkin and Niyogi||2003) . 
The objective in manifold learning is to find a subspace or sub-manifold that captures local structure in the 
data which a global method such as PCA cannot. We focus on locality preserving projections (LPP) \He\ 
and Niyogi, 2003) since it can be efficiently applied to out-of-sample data. For all methods we will also 
provide adaptive algorithms to estimate the dimension of the projective linear subspace. In the next sec- 
tion we outline the main computational and statistical considerations underlying our proposed randomized 
approaches to dimension reduction for massive high-dimensional data. 

2.1 Computational considerations 

The main computational tools used are randomized approximate algorithms for matrix factorization. The 
key idea in randomized algorithms for matrix factorization is that annxp matrix with low rank d can be 
approximately factored in time 0(npd) rather than the 0(n 2 p) required for exact factorization. We assume 
that the data has low intrinsic dimensionality and hence d <C n < p. Given the low rank factorization, 
matrix multiplications in the low dimensional space replace those in the ambient space with little loss in 
accuracy of the matrix operations. The randomized methods we will use provide flexible control of the 
degree of accuracy in the final approximation to the exact sample estimates and make explicit the tradeoff 
between computation time and estimation accuracy. 



4 



2.2 Statistical considerations 

A central concept in this paper is that randomized approximation algorithms used for statistical inference 
impose regularization constraints. Thinking of the estimate computed by the randomized algorithm as a sta- 
tistical model helps highlight this idea. Consider two models or estimators, one corresponding to the exact 
(generalized) eigendecomposition and the other corresponding to the randomized algorithm. The numerical 
analysis perspective typically focuses on the discrepancy between the randomized and the exact algorithm 
evaluated on the same data. The goal is to minimize the difference between the estimates, irrespective of 
the data generating process. However, in many practical applications the data is a noisy random sample 
from a population. This suggests, in particular when the interest is in dimension reduction, that the relevant 
error comparison should be between the true subspace that captures information about the population and 
the corresponding algorithmic estimates. A key parameter of the randomized estimators considered here 
is the number of power iterations used to estimate the span of the data. This parameter induces a solution 
path which converges to the solution of the exact method applied to the sample data as the number of it- 
erations increases. Hence, the power iteration parameter controls both the computational tradeoff as well 
as the amount of regularization applied to the sample estimates. Fewer iterations correspond to stronger 
regularization and reduced runtime. We show in Section[3]that very high dimensional data requires a great 
deal of regularization to achieve small out-of-sample errors. This argues for very few iterations for reasons 
of both computational and statistical efficiency. 

2.3 Notation 

Given positive integers p and d with p ^> d, W xd stands for the class of all matrices with real entries of 
dimension p x d, and SS pxp denotes the sub-class of symmetric positive semi-definite p x p matrices. 
For B £ R pxd , span(B) denotes the subspace of R p spanned by the columns of B. A basis matrix for 
a subspace 5 is any full column rank matrix B £ W xd such that 5 = span(_B), where d — dim(S). 
Denote the data matrix by X = (xi, . . . , x n ) T £ R nxp , where each of the n samples Xi is a random draw 
from a p-dimensional probability distribution V x . In the case of supervised dimensions reduction, denote 
the response vector to be Y G R" (quantitative response) or Y G {1, . . . , r} (categorical response with r 
categories), and Y ~ V Y . The data and the response have a joint distribution (X,Y) ~ V XXY - Unless 
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explicitly specified otherwise, assume that both the sample data and the response (for the regression case) 
are centered, so that Y17=i V i = 5^™=i Xi i = f° r a ^ 3 = 1> ■ ■ • >P- 

2.4 Principal Component Analysis 

In this section we provide a detailed description of the randomized algorithm used as an estimator for PCA. 
The randomization ideas we use draw heavily from previous work in randomized numerical analysis ( Halko 
et al. 2009 ; Rokhlin et al. 2008) . Our main algorithmic contribution is based on the observation that the 
data is a noisy sample from a population quantity. This suggests that algorithmic considerations such as 
the number of iterations also impose modeling constraints such as how much regularization is needed to 
accurately estimate the true population subspace. The objective of PCA is given high-dimensional data find 
a set of d p linear combinations of the original p variables 

£j = bfX, j = l,2,...,d, 

which retain as much as possible of the variation in the original data set. The j-th coefficient vector bj = 
(bij, . . . , b P j) satisfies the following 

1. the linear projections £j, j = 1 . . . , d are ordered by decreasing variance: var(£i) > . . . > var(£d); 

2. £i is uncorrelated with for all i ^ j. 

This problem is stably and efficiently solved, for an arbitrary rectangular matrix X £ R nxp and all possible 
values d G {1, . . . , n}, by Singular Value Decomposition (SVD): 

X = USV T , U T U = V T V = I nxn , (1) 
S = diag(4, ...in), Ii>l2>, ... >£n>0. (2) 

The diagonal entries of S are the singular values, sorted in non-decreasing order, according to the the 
amount of captured variance in the directions defined by the corresponding singular vectors. The first 
d orthonormal columns of the matrix V (with the d largest singular values), provide the exact solution 
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A numerically stable and computationally efficient implementation of SVD is provided as part of the 



industry-standard numerical linear algebra package LAPACK i Anderson et al. 1990 1 requiring 0(n 2 p) 
time and 0(np) of memory. When both the number of variables (p) and the number of data points (n) is 
large this runtime cost is prohibitively high. In the next Section we describe how randomized algorithms 
address this problem. 



2.4.1 Randomized PC A 

It is possible to estimate the maximum variance directions, or principal components, without computing 
the full eigendecomposition of the covariance matrix. For example when the covariance matrix is low rank. 
In this case an attractive alternative would be a computationally accurate and efficient approximation of 
the top few eigenvector directions. Randomized matrix factorizations achieve such an approximation by 
coupling random projections with numerically stable factorizations of the resulting lower dimensional ma- 
trix objects. The idea of random projection was first developed as a proof technique to study the distortion 
induced by low dimensional embeddings of high-dimensional vectors in the work of Johnson and Lin-| 
denstrauss ( 1984) with much literature simplifying and sharpening the results (Frankl and Maehara, 1987; 
|Indyk and Motwani|["l998[|Dasgupta and Gupta||2003||Achlioptas'l|2001) . More recently, the theoretical 
computer science and the numerical analysis communities discovered that random projections can be used 
for efficient approximation algorithms for a variety of applications ( Drineas et al. , 2006 , Sarlos , 2006 , Lib- 



erty et al. 
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Boutsidis et al 



2009 



Rokhlin et al. 



2008 



Halko et al. 



20091. Given data X € R nxp the 



following algorithm Randomized SVD computes an approximation of the top d singular values and singular 
vectors of X (Ro khlin et al.| [2008 ; Halko et al. 2009 1. The algorithm proceeds in two stages: (1) efficiently 
identify an orthonormal basis for the range of X that captures the directions of high variation, (2) project 
the data onto this bases and apply SVD. 

Algorithm: Randomized SVD 
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(1) Find orthonormal basis for the range of X; 

(i) Set the number directions: £ = d + A; 

(ii) Generate random matrix: fi G K nx£ with l ~ N(0, 1); 

(hi) Construct blocks: F w = XX T F (i ^ 1) with = XX T Q, for i = 1, ..,£; 
(iv) Factorize blocks: i? = | F (2) | . . . | F (t) ) = QS\ Q T Q = I; 

(2) Project data onto basis and compute SVD; 

(i) Project onto basis: B = X T Q € K px( " ) ; 

(ii) Compute SVD: B = U"EW T , E = diag(cri, . . . , a txe ), U T U = W T W = J; 

(iii) Compute eigenvectors: {/ = Ud ^ , F = QW, E = diag(<7i, . . . , a d ). 

In stage (1) we first set the number of bases we expect to need £. The parameter A is an oversampling 
parameter, we use 12 as suggested in Rokhlin et al. (2008 ). The parameter d is our expectation of the rank 
of the data matrix. We will state an adaptive procedure to estimate d later in this subsection. In step (iii) 
the random projection matrix Q, is applied to powers of XX T to randomly sample linear combinations of 
singular vectors of the data weighted by powers of the singular values 

= (XX T yn = US 2i V T Q, = US 2t n*, where X = USV T . 

ny.1 

The main goal of the power iterations is to increase the decay of the noise portion of the eigen-spectrum 
while leaving the singular vectors unchanged. This is a regularization or shrinkage constraint. Note 
that each column j of corresponds to a draw from a n-dimensional Gaussian distribution: ~ 
iV(0, US 4,l U T ), with the covariance structure more strongly biased towards higher directions of variation 
as i increases. This type of regularization is analogous to the local shrinkage term developed in Poison 



and Scott 



1 2010 1. In step (iv) we combine blocks F^ 1 ' for i — 1, .., t and factorize the collection using a 
QR factorization. Rokhlin et al. (2008 ) suggested the use of the collection to provide numerical stability in 
same spirit as the (blocked) Lanzcos approach (Chapter 9, in Golu b and Van Loan] ( |1996| >). In stage (2) we 
rotate the orthogonal basis Q computed in stage (1) to the canonical eigenvector bases and scale according 
to the corresponding singular values. In step (i) the data is projected onto the low dimensional orthogonal 
basis Q. Step (ii) computes the singular values and vectors of the projected space. The computational 
complexity of the randomization step is 0(t£np) where £ ~ d is the rank of the approximation and t is the 
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number of iterations of the (blocked) Lanzcos step. The factorization step is 0(£np + t 2 £ 2 n). With t and I 
very small relative to n,p the resulting overall complexity is 0(t£np + t 2 £ 2 n). The runtime in both steps 
is dominated by the multiplication by the data matrix and in the case of sparse data fast multiplication can 
further reduce the runtime. We use a normalized version of the above algorithm that has the same runtime 
complexity but is numerically more stable (Martinsson et al. , 20101. The exact algorithm follows: 
Algorithm 1 : Randomized SVD 

input: X £ M™ xp : data matrix, d: number of required top eigenvalue directions, A: oversampling parameter, 
t: number of power iterations output: S G K d : top singular values, U 6 R pxd : left eigenvectors, V € M nXti : 
right eigenvectors Stage 1: Find approximate orthonormal basis for the range of X 

1. Set I = d + A 

2. Generate Q e R nxl s.t. % - N(Q, 1) \fi,j 

3. Set = n.forj = l,...,t 

(a) Set FW> = X T F ( j-^ 

(b) Factorize = Q[ j) S[ j) , Q^ )T Q^ ] = I 

(c) Set = XQ[ 3) 

(d) Factorize F& = Q 2 m ' , Q { 2 j) Q 2 3) = I 

(e) Set F& = Q 2 i] 

4. Factorize R = (F« \ F^ \ ...\ F&) = QS € R nxtl , Q T Q = I 
Stage 2: Project data onto the orthonormal basis Q and do SVD 

1. Project & Factorize [U, S, V] = full SVD(Q T X) [LAPACK] 

2. Set S = diag(5i, . . . , crj), t> = V"[ , 1 : d], U = QU[ , 1 : d] 
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2.4.2 Adaptive method to estimate d 

In the unsupervised setting, for PCA and LPP, we use the expected reconstruction accuracy upon orthogonal 
projection onto the inferred dimension reduction subspace of a held-out subset of the data as the criterion 
to select the true rank of the data d* . Optimizing this criterion in the population corresponds to 

d* = argmin E x || (I - P 6W )X\\l = \\ (I - B^B^XWl 

d£{l,...,d maa: } 

Leave-one-OUt Cross-validation. Leave-one-out cross-validation (LOO-CV) provides an approxi- 
mately unbiased estimate of d* based on the random sample {X»}" =1 . First, for each d a loss function 
based on predictive performance on held out data is defined as 

cv£ = I E IK 7 - ^ = IK 7 - b^b^xS, 
n i=i n i=i 

where B(%\ € K pxd is a basis matrix for the dimension reduction subspace inferred using the 
full data set excluding the i-th sample. 

C-fold Cross-validation. A more computationally efficient estimate that has lower variance but higher 
bias than LOO-CV is provided by c-fold cross-validation (e.g. c = 5 or 10). To simplify notation let 
n — a x c and denote the c-fold cross-validation loss function estimate for a pre-specified value of the 
parameter d to be CV^ M . Randomly partition the input data {Xi}" =1 into c equally sized disjoint subsets 
{X(i)}i = i, where Xfo denotes the j-th sample point within subset i. Then 

i — l j = 1 i— 1 j — 1 

where b!*L G W xd is the orthogonal basis matrix for the dimension reduction subspace inferred using 
the full data set excluding the i-th subset of size a, X^ . We optimize over the full range of allowable 
values for d to arrive at the final estimate 

d*-Mi= argmin CV^ ld . 

de{l,...,d max } 

Runtime Analysis. The overall asymptotic runtime scales linearly with c for a fixed value of the pa- 
rameter k. In more detail, for a fixed value of of d £ {1, . . . , d max } there are two majors computational 
steps: 
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• estimate projective subspace , for i = 1, . . . , c: 0(c X [tl(n — a)p + t 2 l 2 (n — a)]) 

• project data and construct error estimate: 0(lnp) 

Hence, the overall asymptotic runtime is dominated by the projective subspace estimation and is of 0(<i max x 
cx [tl(n— a)p+t 2 l 2 (n— a)]). For small values of d max this is on the order of a single iteration of Algorithm 
□ 

2.4.3 Accuracy of the randomized algorithm 

Assumig the the data matrix is a noisy random draw from the population quantity there are main two 
criteria by which the accuracy of the approximate algorithm can be measured. The first is an estimate of 
the accuracy of the approximate factorization on the observed data matrix. The second criteria is the error 
between the estimated subspace and the subspace that captures the structure of the population model. We 
estimate the second criterion using held-out subset of the data which is not used for estimation. In terms of 
the error in reconstructing the data matrix it has been shown (Rokhlin et al. 2008 ; Halko et al. 2009 1 that 
the randomized SVD algorithm is highly accurate and computationally efficient. In Rokhlin et al. (20081 
the following bound on the reconstruction error of the rank-d approximation to X was given 

\\X-U±V T \\ 2 <Cn 1/<4t) a d+1 , 

where <J(d+i) is the (d + l) s * singular value of X, \\ ■ \ \' 2 denotes the spectral norm, t is the number of 
power iterations, and C is a constant independent of X. In addition, Rokhlin et al. (2008) demonstrated 
through numerical experiments that the theoretical bounds hold in practice with greatly reduced scaling 
constants as compared to the theoretical predictions, independent of the matrix being approximated. From 
an inference perspective a factorization that exactly reconstruct the data matrix X may not be ideal because 
it may be susceptible to overfitting the noise in the data, especially when the variable dimension is high 
relative to the sample size. Hence, rather than optimizing the exact reconstruction accuracy we propose to 
optimize the tradeoff between exactly fitting the sample data and good performance on future data. This 
strategy is implemented using an out-of-sample predictive accuracy criterion to set an appropriate value for 
the regularization parameter t. We will show in Section|3]different scenarios when the dimension reduction 
estimators have better out-of-sample performance for smaller values of t (large shrinkage) than for large t 
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(little shrinkage). In short, poor estimators for the in-sample data matrix can be superior on out-of-sample 
data. 

2.5 Generalized eigendecomposition and dimension reduction 

For each of the other three dimension reduction methods: SIR, LSIR, and LPP, we will provide randomized 
approximate algorithms to address the problems of supervised and unsupervised dimension reduction cap- 
turing non-linear (manifold) structure. Inference for all three methods involves the solution of a generalized 
eigedecomposition problem of the following form 

r<? = AEp, (4) 

where E,T £ SS pxp are symmetric positive semi-definite matrices. Let d -C min(n,p) be the intrinsic 
dimensionality of information contained in the data. Then our main objective is to find a basis for the 
subspace spanned by the generalized eigenvectors {g*, . . . ,g2}. An important structural constraint in all 
three problems is that F is low-rank. It is this constraint that we will take advantage of in the randomized 
methods. 

2.5.1 Supervised dimension reduction 

Dimension reduction is often a first step in further downstream data analysis including data visualization and 
regression. If the ultimate goal is regression then the low dimensional summary Z = R{X) is computed 
from the data for which there exists an accurate predictive model of the future response variables 

Y — f(X) + e — h(Z) + e, X £ R p , Z e R d , d < p. 

Sufficient dimension reduction (SDR) (pl [T99T][Cook and Weisb^[T99Tj pl [T992{[Hastie and Tibshiraml 
[19961 |Goldberger et al.|[2005l|Globerson and Roweis1[200o1[LTeT^[2005][Niisson et al.||2007l|Sugiyama| 



2007 Cook 2007 Wu et al. 20 10 aims to provide a summary R(X), which captures the predictive 
information of X about the response Y . A very useful consequence for predictive modeling is the identity 
E[y | X] = E[y | R(X)], see |Cook|p007] for more details. In this paper we focus on linear SDRs which 
correspond to projections of the data onto a subspace G = (<?i, . . . , gj,) G W xd called the dimension 
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reduction subspace. Hence, the following condition will hold: 

(Y | X) = (Y | G T X), = is equivalence in distribution. 

We will consider two specific supervised dimension reduction methods: Sliced Inverse Regression (SIR) 
( jLij |1991| > and Localized Sliced Inverse Regression (LSIR) l |Wu et afj |2010[ >. SIR is effective when the 
predictive structure in the data is global, i.e. there is single predictive subspace over the support of the 
marginal distribution of X. In the case of local or manifold predictive structure in the data, LSIR can be 
used to compute a projection matrix G that contains this non-linear (manifold) structure. 

Inverse Regression (SIR) Sliced Inverse Regression (SIR) is a dimension reduction approach intro- 
duced by |0|fl99T) . The relevant statistical quantities for SIR are the covariance matrix, E = cov(X), and 
the covariance of the inverse regression, V — cov[Ex(^|^)]. We define the following eigendecomposition 
problem 

Fg t = \ t Zg t , (5) 

and the dimension reduction subspace is G — span(gi, <?d) | Ai, Ad > 0, the eigenvectors corre- 
sponding to non-zero eigenvalues. If there exists a linear projection matrix G with the property that 

span(Ex[AT | Y] - E X [X]) £ span(EG) (6) 

then SIR is effective. Given observations {(xi, 3/i)}" =1 we obtain an estimate of G by computing empirical 
estimates E and T. A standard estimator for the covariance is used E = — x i x f ■ To compute V the 

data is first divided into H groups or slices Sh with observations in each group and then T is computed 

H n h 1 
f = V] — iih\ih \ {ih = — y~] Xj, h = 1, H. 
A — ' n rih — 

h=l n ]GS h 

The empirical version of |5]( is solved to compute G 

Yg l = \ i tg i , G = span(gi, g d ) \ Ai, \ d > 5, (7) 

where 8 > is a fixed threshold. Problems with SIR arise when the statistical assumption in l|6j is not 
satisfied, that is the predictive structure in the data is non-linear. This can be due to the data being concen- 
trated near a manifold or having clustering structure. SIR can be adapted to address this setting by taking 
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into account the local structure of the explanatory variables conditioned on the response variable. The key 
idea behind Localized Sliced Inverse Regression (LSIR) (Wu et al. 2010) is based on the observation in 
manifold learning that Euclidean structure around a data point in W is only useful locally. This suggests 
computing a local version of the covariance of the inverse regression T bc . First we compute a local mean 
for each observation: 

where Pi denotes the set of indexes of the fe-nearest neighbors, considering only the data points with 
response values within the slice to which x, belongs. Given the local means we compute i\ oc as 



f, 




i=l 

and solve the generalized eigendecomposition problem 

T ix gi = Xi'Sgt, G = span(gi, ...,gd) | Ax, Ad > 5. (8) 

For both SIR and LSIR the most straightforward solution to the generalized eigendecompositions in equa- 
tions {7} and is based on the Cholesky decomposition of E and typically requires runtime complexity 
0(np 2 ). In the next subsection we provide an efficient algorithm to compute the generalized eigendecom- 
positions based on the (approximate) SVD of the data matrix X. 

Efficient solutions and approximate SVD Both SIR and LSIR reduce to solving a truncated gener- 
alized eigendecomposition problem 

tg = \tg, (9) 

with G as the subspace spanned by the generalized eigenvectors gi, . . . ,ga, with largest generalized eigen- 
values Ax, ... , Ad, and d is an estimate of the intrinsic dimensionality of the data - the number of pre- 
dictive directions. We propose an alternative to the full Cholesky-based factorization that takes advan- 

~ 1 rp 

tage of the low-rank structure in T. Set = SU as the square root of the spectral factorization 
f = US 2 U T — (f2) T f2. The generalized eigendecomposition problem can be rewritten as the fol- 
lowing symmetric eigendecomposition 

(f~5) T Sf"5e = \e, e = fig. (10) 
A 
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The dimension reduction subspace G will be span of (g± = f _ ^ e\, . . . , gd = f ~ 2 g d ) ; and d is the rank 
of r. If E and Y can be computed efficiently and are (approximately) low-rank then we will show that the 
above computation can be much faster than the Cholesky-based factorization. We first consider the case 
of SIR. Sort the samples in decreasing order of the response and slice the samples into H slices. For each 
slice h we define the row vector Jh = (1 ■ ■ ■ l)ixn h and the matrix 



<Air — 



Ji 







\ 







./// 



' n X tl 



Given J s i r and the data matrix we observe that 



f — X •/„ , L i A . 



By construction rank(f ) < H — 1 <C min(n,p) and f = X T J S1I G R pxff can be constructed in 0(np) 
time. The full eigendecomposition of f 5 is 0(H 2 p). Recall that X = E 1 / 2 so the explicit construction 
and full eigendecomposition of E^f ~2 = XUS + is 0{Hnp + H 2 p). Hence the computation of the 
exact solution to the optimization problem posed by SIR requires 0(H 2 p + Hnp) time. This can be made 



faster by using RSVD approach in Section 2.4.1 When the number of slices is small, H <C min(n, p), the 
time savings could be substantial as compared to a typical Cholesky-based factorization which scales as 
0{n 2 p). We now consider the LSIR case. For each slice h £ {1, . . . , H} we construct a block matrix J 
that is rih x with each entry J* = i if i, j are fc-nearest neighbors (k fixed parameter) and otherwise, 
Ju = %■ We then construct the block diagonal n x n matrix Ji s i r = diag( J%, Jh)- Then 



fioc — X Jj^JlstfX. 



To compute the fc-nearest neighbors we need the pairwise distances between points. We do not do this 
in the p-dimensional data space but use the eigenvector coordinates of the marginal covariance matrix. 
Given the SVD of X constructing f 2 := X T ' Jisir G W xn requires 0(dnp) operations. Computing 



the SVD of X using the RSVD approach in Section 2.4.1 requires 0(tlnp) operations. Similarly an 



approximate eigendecomposition of f 2 , f I , requires 0(t£np) operations. The final step of finding the 



top eigenvalues and eigenvectors of T» 2 E 2 = r, 2 X £ 



using exact SVD and back-transforming 



into the ambient basis takes 0(dnp). This brings the runtime complexity of the approximate solution to 



15 



0((t£ + d)np) versusO(n 2 p) for the standard Cholesky-based approximation. The detailed algorithm is 
stated in Algorithm[2] 



Algorithm 2 : Randomized (L)SIR 



input: X £ M. nxp : data matrix, H: number of slices, fc: number of nearest neighbors [LSIRonly], d: dimen- 
sion of sufficient dimension reduction subspace, A: oversampling parameter for Ranomized SVD [default: 
A = 12], t: number of power iterations for Randomized SVD [default: t=l] output: G(l)sir G K pxd : a 
basis matrix for the effective dimension reduction subspace Stage 1: Estimate low-rank approximation to T 



1. Construct J(L)siRi as described in Section 2.5.1 

2. Set L = X T Jf L)SIR (=> ^(l)sir = LL T ) 

3. Factorize [U, S,V] = RandomizedSVD(L,d, A,t) 
Stage 2: Solve the generalized eigendecomposition 

1. Construct A = SU T X T e R dx P 

2. Factorize [U, S, V] = full SVD(A) [LAPACK] 

3. Back-transform G {l)S ir = USU T e R pxd 



2.5.2 Adaptive method to estimate d 

In the supervised setting, for SIR and LSIR, we use the expected prediction error from kNN regres- 
sion/classification upon orthogonal projection onto the inferred dimension reduction subspace as the cri- 
terion to select the true rank of G, d* . In particular, for regression we use the expected squared prediction 
error estimated by the sums of squares of the predicted residuals as the optimization criterion while for 
classification we select the value for d* that minimizes the expected misclassification accuracy as estimated 
by the observed misclassification performance. Optimizing this criterion in the population corresponds to 

d" = argmin E Y , X \\L(Y, Y (d ' k) )\\, 

d£{l,...,d max } 
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where y (d,fc) = kNN(XG (d) , Y, k) and L(Y, Y {d ' k) ) = ||y-Y (d ' fe) ||l for regression and L(Y, Y (d<k) ) = 
lyjy(d,*0 f° r classification. Here, k is a fixed parameter denoting the number of nearest neighbors used 
for the class estimation and G^ d ' is the estimate of a basis matrix for the edr subspace of dimension d. 

Leave-one-OUt Cross-validation. Similarly to Section |Z4~2| we propose to use the Leave-one-out 
cross-validation (LOO-CV) estimator for d*, based on the random sample {Xi}" =1 . First, for each d the 
expected loss is estimate using held out data 

where = kNN(XG , j^' i j , Y, k) is the k nearest neighbor prediction for Yi, based on projection of the 

data onto estimated reduction subspace inferred using {Yj-_j), X /_£*}, the full data set excluding the i-th 
sample. 

C-fold Cross-validation. In practice, for massive data it is of importance to use a more computa- 
tionally efficient estimate provided by c-fold cross-validation (e.g. c = 5 or 10). To simplify notation let 
n — a x c and denote the c-fold cross-validation loss function estimate for a pre-specified value of the 
parameter d to be Cv£^ ld . Randomly partition the input data {(Xi, Yi)}" =1 into c equally sized disjoint 
subsets {Di := F (i) )}f =1 , where {(X t , Y)}? =1 = Uf =1 A. Then 

i=l 3=1 

where Yij denotes the observed response for the j-th sample within class i and Y^'^j^ is the predicted value 
for the j-th sample point in the i-th subset Di, based on projection of the data onto the estimated reduction 
subspace. The inference is based on the full data set, excluding the i-th subset. We optimize over the full 
range of allowable values for d to arrive at the final estimate 

rfc-fold = argmin CV^ W . 

de{l,...,d max } 

2.5.3 Unsupervised manifold learning 

Dimension reduction based on graph embeddings seek to map the original data points to a lower dimen- 
sional set of points while preserving neighborhood relationships. The theoretical assumptions underlying 



17 



these methods are that the data lies on a smooth manifold embedded in the high-dimensional ambient space. 
This manifold is unknown and needs to be inferred from the data. Given sufficient number of observations 
the manifold can be reasonably represented as a G — (E,V) (Chung, 1997 1 where the vertexes {vi, .., v„} 



correspond to the n observations {x\, .., x n } and the edges corresponds to which points which are close to 
each other. For example, this neighborhood relationship can be encoded in a sparse symmetric adjacency 
matrix W. Given W, the Laplacian Eigenmaps algorithm Belkin and Niyogi (2003) embeds the data into 
a low dimensional space preserving local relationships between points. Given the adjacency or association 
matrix the Graph Laplacian is constructed L — D — W, where D is a diagonal matrix with Da — Wij . 
A spectral decomposition of L 

Lvi = XiVi 

results in eigenvalues Ai = < A2 < ... < A„ with V\ = 1. Projecting the matrix L onto the d 
eigenvectors corresponding to the smallest d eigenvalues greater than zero embeds the n points into a d 
dimensional space. Under certain conditions ( Belkin and Niyogi 2005 1, the Graph Laplacian converges to 
the Laplace-Beltrami operator on the underlying manifold. This provides a theoretical motivation for the 
embedding. This embedding needs to be recomputed when a new data point is introduced and typically 
will not be a linear projection of the data. Hence, for computational reasons it would be advantageous to 
have a linear projection that can be applied to new data points without having to recompute the spectral 
decomposition of the graph Laplacian. The goal of Locality Preserving Projections ( He and Niyogi] |2003| > 
is to provide such a linear approximation to the non-linear embedding of Laplacian Eigenmaps ( Belkin and 
Niyogi, 2003 ). The dimension reduction procedure starts by specifying the dimension of the transformed 
space to be d < n. Let the parameter defining the neighborhood size be k. Locality Preserving Projections 
(LPP) ( He and Niyogi , 2003 ) is stated as the following generalized eigendecomposition problem. 

X T LXe = \X T DXe, 
X T (D - W)Xe = XX T DXe, 

(11) 

X T WXe = (1 - \)X T DXe, 
X T WXe = fiX T Xe, 
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where, X = D*X,W = D 2 WD i , fi = 1 — X and for a fixed bandwidth parameter b, 

{exp { — r Xj - \ if Xi is among the fc-NN of Xj or vice versa 
otherwise. 
The column vectors that are the solutions to equation l | 1 I [ 1 are the required embedding directions {e-,}, 
ordered according to their generalized eigenvalues 1 = fio > fJ-i > ■ ■ ■ > jUd-i. Hence the neighborhood- 
preserving optimal embedding according to the LPP criterion is: 

Xi -> A T yi, where A = (e ,ei, . . . ,e d -i)- 

LPP is obtained from a graph embedding, using a nearest neighbor graph that captures local manifold 
structure. If a complete graph is generated using Euclidean inner products between data points, then a very 
similar result to PCA is produced, emphasizing the global structure of the data (W = XX T ). 

X T WXe = fiX T DXe 
(X T xfe = fiX T DXe. 

In the case when the diagonal matrix D is close to identity matrix, X T DX w X T X the minimum eigen- 
values of equation \12\ correspond to the maximum eigenvalues of 

(X T xfe = fiX T Xe 
X T Xe = fie, 

which is the optimization problem that is solved by PCA. Hence, LPP with a complete inner product graph 
is similar to PCA. The only difference is that the matrix D is used to measure the local density around each 
data point (by its degree on the neighborhood graph) while PCA treats all point equally. 

Randomized Algorithm for LPP The approximation algorithm to solve LPP requires solving the 
generalized eigendecomposition stated by the last equation in derivation [TT| There will be on the order of 
kn non-zero entries. Hence, the matrix W will be sparse, assuming the size of the local neighborhoods 
k <C n. This implies that the matrix product X T WXQ., where Q G R pxi can be efficiently computed 
in 0((k + l)np) time, without explicitly constructing W by using the fc-nearest neighbor matrix. Hence 
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X T WX can be efficiently approximated using RSVD from Section 



2.4.1 



which would provide 



T5 = SU 1 . Then, setting E := X 1 X, we can proceed analogously to Section|2.5.l|to solve the problem 



i — — i 



(12) 



The dimension reduction subspace G will be span of (gi = f 2 ex, ■ ■ ■ , ga = f 2 e d)- The rank d of f 
and can be efficiently estimated using the adaptive procedure stated in Section [2.4.2| 



Algorithm 3 : Randomized LPP 

input: X 6 E™ xp : data matrix, k: number of nearest neighbors capturing local manifold structure, d: di- 
mension of the embedding subspace, A: oversampling parameter for Randomized SVD [default: A = 12], 
t: number of power iterations for Randomized SVD [default: t=l] 

output: Glpp € W xd : a basis matrix for the embedding subspace Stage 1: Estimate low-rank approxima- 
tion to X T WX 

1. Construct X and W, 

2. Factorize [U, S, U] = RandomizedSVD(X T WX, d, A, t) 
Stage 2: Solve the generalized eigendecomposition 

1. Construct A = SU T X T e R dx P 

2. Factorize [U, S, V] = full SVD(A) [LAPACK] 

3. Back-transform G LPP = USU T € R pxd 
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3 Results on real and simulated data 

We use real and simulated data to demonstrate two points: (1) in both the supervised as well as unsupervised 
setting the randomized algorithms are much faster than the exact methods with minimal loss in estimation 
accuracy and (2) in the supervised setting the randomized algorithms are both much more computationally 
efficient and can have better out-of-sample accuracy than the exact methods. The second point is shown to 
be true in the least eigenvalue setting, when the response is associated strongly only with the least important 
principal component l |Hotelling| |T957| [Cox] [1968) | Jolliffe| [T982) . 

3.1 Simulations 

3.1.1 Unsupervised dimension reduction 

We begin with unsupervised dimension reduction and focus on PCA comparing an exact version of SVD 
with the randomized SVD method introduced in Section l2l4l The simulated data will be drawn from a 
model with spiked Wishart covariance structure: 

Xi ~ N(0, diag(cri, . . . , a p )), fori = 1, . . . , n, 

where we will vary the decay in the eigenvalues in the simulations. Our objective is to characterize the 
runtime advantage of the randomized SVD and to assess the relative sample eigenvalue estimation accuracy 
as a function of the number of iterations t. We first explore the effect of the regularization parameter t - the 
number of iterations of the power method. We spike the first twenty eigenvalues ,Oj = 5/ y/n, 1 < j < 20, 
and then model the remaining eigenvalues as noise, Oj — 1/ y'n, 20 < j < p. We set the oversampling 
parameter A = 10 and let t range from 1 to 5. For an input matrix of dimension n = 2000 and p = 1000 
we plot the percent relative accuracy over 10 draws in Table[T] Using the same simulation setting as before 



t 


1 


2 


3 


4 


5 


% relative accuracy 


73.69 ±0.44 


91.01 ±0.39 


97.57 ±0.41 


99.44 ± 0.18 99.S 


8 ±0.05 



Table 1 : Randomized SVD eigenvalue estimation accuracy for different levels of regularization t. 



and fixing the regularization parameter to be t = 3 we computed the ratio of the runtime of the exact 
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SVD over the randomized SVD: r = 



exact SVD 



. Based on 10 random data sets we observe that 



randomized SVD 



r = {12.43 ± 0.14, 17.13 ± 0.11, 24.99 ± 0.27} for p = {1000, 1500, 2000} respectively. Notice the 
gain increases as the matrix becomes more square - the smaller dimension approaches the larger dimension 
as is the case for massive high-dimensional data. To study the comparison of the exact and randomized 
SVD when there are two eigen-gaps we set the eigenvalues to be Oj — lO/y'n, 1 < j < 10, <Tj = 
10 < j < 20, and Oj — 1/y/n, 20 < j < p. The sample size and the dimension are is fixed to be 
n = 2000, p — 1000. We fix the oversampling A = 10 and the number of iterations t = 2. Over 10 draws 
the relative estimation accuracy were 99.88 ± 0.02 for the eigenvalues above the first gap, 98.34 ± 0.24 for 
the eigenvalues between the first and second gap, and 93.06 ± 0.27 for the eigenvalues below the second 
gap. The eigenvalue estimates rapidly converge for the top two regimes. 

3.1.2 Supervised dimension reduction of a latent factor regression model 

In this section we explore the accuracy of randomized dimension reduction in the setting of factor regres- 
sion. The modeling assumption is that the response variable Y is strongly correlated with a few directions in 
covariate space X. We compare six methods for estimation of the predictive subspace: SIR as implemented 
in |Weisb erg (20021 is denoted as dr. sir, SIR as implement by the (randomized) Algorithm[2]is denoted as 
rand. sir, LSIR as implemented in |Wu et al7| ( |2010| > is denoted as lsir, LSIR as implemented by the (random- 
ized) Algorifhm[2]is denoted as rand.lsir, PCA denoted pea, and (randomized) PCA as implemented by the 
(randomized) Algorifhm[T](rand.pca). We will use a latent factor regression model ( West , 2003 1 to simulate 
the data used to evaluate the accuracy and speed of the six methods. The rationale behind using a latent 
factor regression model is that we have explicit control of the direction of variation in covariates space X 
which is strongly correlated with the response Y. The latent factor model corresponds to the following 
decomposition 



Yi = 



Af e + u 



X, = 



BSXi + Vi 
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with d < p, B G 



B T B — Id, and S = diag(si, . . . , Sd). The random variables \i, v iy e t are 



assumed to be independent and normally distributed 



^ h ^ 



N(o,n v 



r 2 
A 2 



Under the above model as the noise in the covariates decreases (A goes to zero) we obtain 

Y % | x % ~ N(xf BS~ 1 6, r 2 ), 
which corresponds to the principal components regression model 

Yi | Xi ~N(a;*6l*,r 2 ), with x* = XiB G K d and 6* = 5 _1 6». 

The dependence between X; and Yi is induced by marginalizing the latent factors A^. The joint distribution 
for (Yi, Xi) is normal 



Yi 



N(0,E), E : 



^ BS 2 B T + A 2 BS9 S 



e'SB 1 



as is the conditional 



Yi | Xi,0,S,~ N (e T SB T Cx u t 2 + 9 T (I- SB T CBS)6^ , C = (BS 2 B T + A 2 )" 1 G E p . 

For the above model we will use two distinct simulation settings. In the first setting the dominant directions 
of variation in the covariates will correlate with the regression coefficient. In the second setting directions 
explaining less variation in the covariates will correlate with the regression coefficients - this is the least 
eigenvalue scenario. In both settings we set the number of factors d to be 10. The parameters {#;} and 
{si} are drawn from a t-distribution with 5 degrees of freedom. The two settings correspond to ordering 
the sampled parameters such that 

1. > \9 2 \ > ... > \6 d \ and|si| > \s 2 \ > ... > \s d \ 

2. < \6 2 \ < ... < \9 d \ and|si| > \s 2 \ > ... > \s d \. 

The first setting orders the regression coefficients with the maximal eigenvalues. The second setting assigns 
the maximal regression coefficient to the smallest eigenvalue. The covariance structure is set to be spherical 
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A 2 = tp 2 I. The parameters ip 2 and r 2 control the percentage of variance explained or signal-to-noise (we 
set this to 0.8) 



0.8 = &2n x = — r— . , 0.8 = s2n y - - — . 

i/< 2 +min(s 2 ) t 2 + 6 t 6 Var(Yj) 

We used four criteria to evaluate the projections we obtained from the six methods. The first was CPU 
time which includes pre-processing the inputs, estimation, and post-processing of the results - denoted as 
time. The second was the absolute value of the correlation of the dimension reduction space which in our 
simulations was the vector 

b = S Y \x = spanfCovpO^Covpf, Y)] = span[{BS 2 B T + A^BSG]. 

We report the absolute correlation (AEDR) of b with the effective dimension reduction estimate b, |corr(6, 6)|. 
The third and fourth metric are based on predictive criteria. In the first case the criterion used is an estimate 
of the mean square prediction error (MSPE) 

E\\Y-Zb\\l, 

where Z is the projection of the data X onto the edr subspace and b are the regression coefficients estimates. 
The fourth metric is the proportion of the variance explained by the linear regression (R 2 \ Cor(Y,f) 2 , 
where Y = Zb. We generated 10 data sets. The performance on the different evaluation criteria was 
estimated using 5-fold cross-validation. For each of the two parameter settings we examined two data size 
regimes, n > p and p > n. For the LSIR and SIR algorithms the number of slices is set to ten, H — 10, 
and for LSIR the nearest neighbor parameter is set to five, k = 5. The rank of edr subspace is set for 
all algorithms to d = 1. For the randomized SVD we use t — 1 power iterations and A = 5 for the 
oversampling parameter. Table [2] and [3] report the results for the various methods in the setting where the 
signal is correlated to the top principal components and either n > p or n < p, respectively. Table [4] 
and|5]report the results for the various methods in the setting where the signal is correlated to the bottom 
principal components and either n > p or n < p, respectively. As expected SIR runs much faster than 
all other approaches since the size of the matrix decomposed is equal to the number of slices H which is 
much smaller than n and p. The runtime for rand.lsir tends to be comparable to SIR and is much faster than 
the other methods. It is known that LSIR does not perform well without adding a regularization term (Wu 
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|et al.||20To| > and we reproduce this behavior. It is interesting that rand.lsir does not share this problem and 
does perform well without regularization, the randomization adds an implicit regularization in this case. In 
the setting where the regression signal is embedded into the bottom principal components randomization in 
LSIR has a strong effect in improving the predictive accuracy. 



Method 


Time 


R 2 


MSPE 


AEDR 


dr. sir 


15.13 ± 0.26 


0.52 ±0.02 


2.98 ±0.10 


0.08 ±0.04 


rand. sir 


0.06 ±0.00 


0.78 ±0.02 


1.06 ±0.05 


0.50 ±0.21 


lsir 


19.72 ±0.08 


0.22 ±0.02 


5.04 ±0.40 


0.04 ±0.02 


rand.lsir 


0.08 ±0.01 


0.73 ±0.08 


1.33 ±0.33 


0.37 ±0.16 


pea 


13.23 ±0.03 


0.39 ±0.15 


2.99 ±0.68 


0.37 ±0.17 


rand.pca 


0.43 ±0.00 


0.39 ±0.15 


2.99 ±0.68 


0.37 ±0.17 



Table 2: Performance evaluation of dimension reduction approaches when the signal for regression is cor- 
related with the top principal components. The simulation is based on 10 random datasets of dimension 
n = 2000, p = 1000. 



Method 


Time 


I? 2 


MSPE 


AEDR 


rand, sir 


0.06 ±0.01 


0.77 ±0.05 


1.18 ±0.26 


0.20 ±0.08 


lsir 


9.38 ±0.30 


0.03 ±0.03 


4.83 ±0.34 


0.01 ±0.01 


rand.lsir 


0.08 ±0.00 


0.59 ±0.19 


2.15 ±0.96 


0.08 ±0.08 


pea 


13.78 ±0.02 


0.41 ±0.14 


2.99 ±0.77 


0.31 ±0.16 


rand.pca 


0.41 ± 0.00 


0.41 ±0.14 


2, 99 ±0.77 


0.31 ±0.16 



Table 3: Performance evaluation of dimension reduction approaches when the signal for regression is cor- 
related with the top principal components. The simulation is based on 10 random datasets of dimension 
ri = 1000, p = 2000. 



25 



Method 


Time 


R 2 


MSPE 


AEDR 


dr. sir 


15.14 ±0.23 


0.40 ±0.03 


2.93 ±0.10 


0.33 ±0.02 


rand. sir 


0.06 ±0.00 


0.38 ±0.14 


2.24 ±0.43 


0.22 ±0.09 


lsir 


19.69 ±0.10 


0.14 ±0.07 


3.94 ±0.29 


0.14 ±0.05 


rand.lsir 


0.08 ±0.00 


0.16 ±0.20 


3.06 ±0.83 


0.36 ±0.22 


pea 


13.24 ±0.06 


0.01 ±0.01 


3.59 ±0.27 


0.00 ±0.01 


rand.pca 


0.43 ±0.01 


0.01 ±0.01 


3.59 ±0.27 


0.00 ±0.01 



Table 4: Performance evaluation of dimension reduction approaches when the signal for regression is corre- 
lated with the bottom principal components. The simulation is based on 10 random datasets of dimension 
n = 1000, p = 2000. 



Method 


Time 


I? 2 


MSPE 


AEDR 


rand. sir 


0.06 ±0.00 


0.09 ±0.04 


3.44 ±0.47 


0.09 ±0.04 


lsir 


9.46 ±0.19 


0.04 ±0.03 


3.53 ±0.39 


0.05 ±0.02 


rand.lsir 


0.08 ±0.01 


0.16 ±0.03 


3.21 ±0.45 


0.15 ±0.05 


pea 


13.80 ±0.02 


0.01 ±0.00 


3.65 ±0.46 


0.01 ±0.01 


rand.pca 


0.42 ±0.01 


0.01 ±0.00 


3.65 ±0.46 


0.01 ±0.01 



Table 5: Performance evaluation of dimension reduction approaches when the signal for regression is corre- 
lated with the bottom principal components. The simulation is based on 10 random datasets of dimension 
n = 2000, p = 1000. 

3.1.3 Classification example 

This section demonstrates the performance of the dimension reduction approaches on the XOR classifi- 
cation example from ( |Wu et afj |201(ty , Section 4, (Fig 1). The dimension reduction subspace is two- 
dimensional, symmetric, with clustering structure. SIR is expected to fail to recover the edr subspace 
because _E[X|Y] C Sy\x - the edr subspace inferred by SIR in this case is degenerate. Figure fTland 
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[2] illustrate the results when n > p and p > n, respectively. As expected, SIR fails to recover a two- 
dimensional subspace, while both LSIR and rand.lsir provide reasonable separation when n > p. Only 
LSIR with randomization performs well in the more difficult setting when p > n - in (Wu et af||2010) , 
Secton 4, LSIR with regularization performed well in this same p > n however rand.lsir performs well 
even without adding a regularization term. 



Figure 1: XOR classification example. n=3000, p=1000 
dr.sir(n=3000,p=1 000) sir(n=3000,p=1 000) 



CD 




-1 1 -0.10 0.00 0.05 0.10 

edii edr! 



lsir(n=3000,p=1 000) rand.lsir(n=3000,p=1 000) 




— i 1 1 1 1 — i n 1 1 1 r~ 

-0.04 -0.02 0.00 0.02 0.04 -0.04 -0.02 0.00 0.02 0.04 

edii edr! 
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Figure 2: XOR classification example. n=1000, p=3000 
sir(n=1 000,p=3000) lsir(n=1 000,p=3000) 




-0.06 -0.02 0.02 0.04 0.06 



3.2 Real data 
3.2.1 Digit Recognition 

The MNIST data set (Y. LeCun, htpp://yann.lecun.com/exdb/mnist/) contains 60,000 images of handwrit- 
ten grey-scale digits of dimension p = 28 x 28 = 728. This data set has been extensively studied in the 
past and has been found to contain non-linear structure. The dimension reduction methods we compare are 
PCA, rand.pca, SIR, LSIR, rand.lsir, and random projections (rand.proj) on this data set. We use classifi- 
cation accuracy as our metric and use a 5-nearest neighbor classifier in the edr space as the classification 
algorithm. We set the dimension of the edr for all the methods to d = 20. For the randomized methods 
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the oversampling parameter A = 5 and the number of iterations t = 1. For LSIR we set the number of 
nearest neighbors k — 5. The training and test sets consist of 100 randomly selected images of each digit, 
respectively. The performance results are summarized in Table [6] As in the previous section we see that 
the randomized supervised methods outperform their exact analogs. We suspect this is again due to their 
implicit regularization. Note that LSIR properly tuned will do better than SIR on this classification task 



( |Wu et ai"}|2010} - in our implementation of LSIR we did not include the regularization parameter. 



Digits 





1 


2 


3 


4 


5 


6 


7 


8 


9 


pea 


90.0 ± 1.6 


98.9 ±0.7 


88.6 ±3.5 


86.3 ±4.4 


85.0 ±5.2 


85.3 ±1.9 


95.1 ± 2.4 


90.0 ±2.5 


80.4 ± 4.0 


83.2 ±4.5 


rand. pea 


96.0 ±1.8 


98.8 ±0.8 


88.6 ±3.6 


86.9 ±4.0 


83.9 ±5.5 


86.1 ±3.1 


95.5 ± 2.5 


89.5 ±2.7 


80.5 ±3.7 


84.0 ± 3.6 


dr. sir 


91.2 ±1.9 


97.7 ± 1.4 


51.3 ± 7.6 


67.6 ±7.5 


70.8 ± 5.7 


56.6 ±7.5 


81.5 ±2.8 


71.8 ±5.9 


55.0 ± 4.9 


72.2 ±6.2 


sir 


90.0 ± 3.4 


96.6 ±2.0 


73.3 ± 4.9 


76.2 ±3.8 


82.9 ± 2.9 


67.7 ±3.2 


87.0 ±3.8 


80.4 ±4.1 


69.2 ±5.0 


74.4 ± 5.4 


lsir 


75.5 ± 5.0 


95.7 ± 1.6 


38.2 ± 7.5 


52.4 ±8.1 


52.2 ±4.4 


39.6 ±6.2 


61.0 ±5.4.8 


58.3 ±6.2 


36.3 ±4.1 


56.2 ±6.5 


rand.lsir 


92.2 ± 3.0 


98.7 ± 1.3 


81.9 ± 5.7 


82.1 ±4.2 


78.9 ± 2.7 


76.4 ±6.9 


89.1 ± 3.9 


82.2 ±2.9 


66.9 ± 3.8 


78.8 ± 5.8 


rand.proj 


86.1 ±3.8 


98.6 ±1.3 


54.1 ± 10.1 


66.2 ±8.6 


65.9 ± 7.6 


47.9 ±7.6 


77.4 ±6.9 


77.7 ±5.3 


43.4 ±5.7 


61.9 ± 7.0 



Table 6: Digit recognition (% accuracy). 100 train and 100 test examples, randomly sampled for each digit. 
3.2.2 Tumor classification 

The gene expression data on 198 tumor samples from Ramaswa my et aL] ( |2001| l, including 14 common 
tumor types, was used in conjunction with 5NN classifier to evaluate the different dimension reduction 
approaches. The full set of 16,063 gene and expressed sequence tags, without any additional pre-processing, 
was used for the purposes of the comparison. The predictive accuracy was estimated as the average over 
10 random binary train/test partitions of the data, requiring at least 5 samples from each class to be present 
in both the train and the test set. For the randomized methods we set the oversampling parameter A = 5, 
the number of power iterations t = 1, and the number of nearest neighbors k — 5. The number of edr 
directions we used were d = 12. Table [7] states the results for the various dimension reduction methods. 
Note that SIR and rand.lsir outperformed the other approaches. 
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Method 


pea 


rand.pca sir 


lsir 


rand. lsir 


rand.prqj 


mean 


45.15 ±4.97 


45.15 ± 5.45 57.78 ±3.95 


11.62 ±2.20 


54.14 ±4.45 


39.60 ±4.94 



Table 7: Gene expression data from the study of 14 common tumor types Ramaswamy et al. (2001 1. The 
input data set was randomly Results were generated based on 10 random splits into train and test set and 
reporting the average. 

4 Discussion 

We adapted randomized SVD to improve the runtime of a variety of dimension reduction methods. The 
main algorithmic contribution of our work is to develop very fast randomized algorithms for a class of gen- 
eralized eigendecompositions which arise in a variety of dimension reduction methods. From an inference 
perspective we observe that randomization imposes implicit regularization on the resulting estimators. This 
is highlighted in the supervised dimension reduction case corresponding to the least eigenvalue situation - 
the regression signal is contained in the eigenvectors corresponding to small eigenvalues. The exact role of 
the number of iterations t as a regularization parameter is an open question. As t gets large the subspace of 
the approximated algorithm converges very quickly to the subspace of the exact algorithm. Hence, in order 
to apply the proposed algorithms to massive data sets we need t to be small. A statistical argument for few 
iterations can also be made based on a similar idea to ridge shrinkage. The main effect of the randomized 
algorithm is a reweighting of the eigenvalues = U S 2t V T Q, where the reweighting comes from S 2t . 
If t is small then the eigenvalues are shrunk much less, this is similar in spirit to flattening the eigenvalues. 
This allows for more mixing between the eigenvectors and much as ridge regression it adds "signal" to 
the lower eigenvalues. This is related to the classic least eigenvalue problem (Hotelling, 1957; Cox] 1 1968] ) 
where the main regression signal is in the principal components corresponding to the smaller eigenvalues. 
Both the randomized algorithm with small t and ridge regularization will help address the least eigenvalue 
problem by boosting the weights of the directions corresponding to smaller eigenvalues. It would be of 
great interest to develop some theoretical analysis of this setting. 
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